Learning Translations from Comparable Corpora
نویسنده
چکیده
This thesis examines the possibility of using comparable corpora to augment statistical models of translation. Treating comparable corpora as marginal samples from an aligned bilingual joint distribution, the estimation of translation models from a combination of bilingual parallel and comparable corpora is seen as a variation of the labelled-unlabelled problem [Seeger, 2000b]. Results on synthetic data confirm that successful re-estimation within the EM framework [Dempster et al., 1977] is highly-dependent on the balance between complete and incomplete data [Nigam, 2001]. Here we show that the utility of re-estimation with additional incomplete data is highly-dependent on the accuracy of initial parameters estimated from the complete data alone. We propose a method for constraining the re-estimation procedure in relation to the degree of comparability between marginal samples. This is seen to result in better conditional models when the assumption of comparability is valid. Finally, we consider how more complex marginal models could be used to further constrain the re-estimation of the conditional.
منابع مشابه
Mining New Word Translations from Comparable Corpora
New words such as names, technical terms, etc appear frequently. As such, the bilingual lexicon of a machine translation system has to be constantly updated with these new word translations. Comparable corpora such as news documents of the same period from different news agencies are readily available. In this paper, we present a new approach to mining new word translations from comparable corp...
متن کاملBootstrapping Entity Translation on Weakly Comparable Corpora
This paper studies the problem of mining named entity translations from comparable corpora with some “asymmetry”. Unlike the previous approaches relying on the “symmetry” found in parallel corpora, the proposed method is tolerant to asymmetry often found in comparable corpora, by distinguishing different semantics of relations of entity pairs to selectively propagate seed entity translations on...
متن کاملRare Word Translation Extraction from Aligned Comparable Documents
We present a first known result of high precision rare word bilingual extraction from comparable corpora, using aligned comparable documents and supervised classification. We incorporate two features, a context-vector similarity and a co-occurrence model between words in aligned documents in a machine learning approach. We test our hypothesis on different pairs of languages and corpora. We obta...
متن کاملIdentification of Fertile Translations in Medical Comparable Corpora: a Morpho-Compositional Approach
This paper defines a method for lexicon in the biomedical domain from comparable corpora. The method is based on compositional translation and exploits morpheme-level translation equivalences. It can generate translations for a large variety of morphologically constructed words and can also generate ’fertile’ translations. We show that fertile translations increase the overall quality of the ex...
متن کاملUtilizing Citations of Foreign Words in Corpus-Based Dictionary Generation
Previous work concerned with the identification of word translations from text collections has been either based on parallel or on comparable corpora of the respective languages. In the case of comparable corpora basic dictionaries have been necessary to form a bridge between the languages under consideration. We present here a novel approach to identify word translations from a single monoling...
متن کامل